HTML API: Apply input preprocessing consistently at Tag Processor read boundaries#53
Open
sirreal wants to merge 19 commits into
Open
HTML API: Apply input preprocessing consistently at Tag Processor read boundaries#53sirreal wants to merge 19 commits into
sirreal wants to merge 19 commits into
Conversation
Red TDD step: browser-verified expectations for raw CR/CRLF/NUL in attribute values; passing pins for encoded /� and for verbatim pass-through of API-supplied values. See #65372.
Attribute values read from the input document now normalize newlines (CRLF/CR to LF) and replace U+0000 NULL bytes with U+FFFD before decoding character references, matching what browsers produce for the same markup. Values enqueued through set_attribute() are plaintext API values and continue to pass through unchanged. See #65372.
Red TDD step: flushing add_class()/remove_class() updates must read the existing class attribute through the same input preprocessing as get_attribute(), normalizing newlines and replacing NULL bytes. See #65372.
class_name_updates_to_attributes_updates() reads the existing class value through the same preprocessing helper as get_attribute(), so add_class()/remove_class() no longer rebuild the attribute from raw source bytes containing CR or NULL. See #65372.
Red TDD step: browser-verified expectations that attribute names are exposed and addressed with U+FFFD replacing NULL bytes, that names collapsing after replacement behave as duplicates of one attribute, and that attribute updates target the replaced name. See #65372.
Attribute lookup keys are normalized where they are created, in parse_next_attribute(): NULL bytes are replaced with U+FFFD before lowercasing, as the tokenizer does in browsers. Names which collapse to the same replaced name are duplicates of one attribute (first one wins), lookups by the raw NULL spelling no longer match, and updates or removals by the replaced name target the source attribute. Raw document spans are untouched. See #65372.
Red TDD step: tag names are exposed with U+FFFD replacing NULL bytes; passing pins confirm NULL bytes never select rawtext parsing and never appear in PI-lookalike comment tag names. See #65372.
get_tag() (and get_token_name(), which delegates to it) returns tag names with U+0000 NULL bytes replaced by U+FFFD, as the tokenizer does in browsers. Internal token identification continues to compare raw bytes: a NULL byte in a tag name already prevents rawtext detection, matching browsers, where the replaced name likewise never equals SCRIPT or the other special names. See #65372.
Red TDD step: browser-verified expectation that classList-equivalent reads preserve NULL bytes in values set through the API; the U+0000 replacement belongs to the tokenizer, and document-sourced values already receive it in get_attribute(). See #65372.
class_list() received its NULL-byte replacement when reading raw class values; that replacement now happens in get_attribute() for values from the input document. Performing it on API-supplied values diverged from browsers, where classList preserves NULL bytes in values set via setAttribute(). See #65372.
Benchmark-guided: reading an attribute value applies up to three str_replace passes which doubled read cost for long values containing no bytes needing replacement. Guarding with strpos keeps the common case at two fast scans; values are typically free of CR and NULL. Benchmark (PHP 8.4, medians of 3): scanning 100-tag documents reading 3 attributes each, 2000 iterations: trunk 667ms, unguarded 714ms, guarded 699ms. Reading a 10.8KB clean attribute value 200k times: trunk 147ms, unguarded 313ms, guarded 258ms. The remaining cost is the unavoidable byte inspection. See #65372.
Red TDD step from adversarial review: a named character reference without a terminating semicolon must decode when followed by a NULL byte or any non-ASCII byte. Replacing NULL with U+FFFD before decoding fed the decoder a multi-byte follower whose classification by ctype_alnum() depends on the process locale, suppressing valid decodes in attribute values, diverging from browsers and from trunk. See #65372.
The tokenizer replaces U+0000 NULL bytes as it consumes input, so a character reference without a terminating semicolon sees the raw NULL byte as its follower, which is unambiguous, and the reference decodes. Replacing before decoding handed the decoder U+FFFD's lead byte, whose ctype_alnum() classification depends on the process locale, wrongly suppressing the decode under UTF-8 locales. No character reference decodes into NULL, so replacing after decoding is equivalent for the value's own bytes and faithful to the tokenizer's order. See #65372.
Per the named-character-reference state, a semicolon-less reference is ambiguous only when followed by an ASCII alphanumeric or equals sign. ctype_alnum() classifies bytes 0x80 and above as alphanumeric under UTF-8 locales, wrongly suppressing decodes followed by any non-ASCII byte and making decoding depend on the process locale. See #65372.
Red TDD step from adversarial review: next_tag() must match tag names in the same U+FFFD-replaced alphabet that get_tag() exposes, so the getter round-trips into queries, raw NULL spellings match nothing, and the Tag Processor agrees with the HTML Processor, whose queries already compare against the replaced token name. See #65372.
next_tag() compared sought tag names against raw document bytes while get_tag() returns names with NULL bytes replaced by U+FFFD, breaking the getter-to-query round trip and disagreeing with the HTML Processor's queries. Matching now happens in the exposed alphabet; the existing byte comparison is unchanged for names without NULL bytes, so the hot path costs the same. See #65372.
Red TDD step from adversarial review: get_attribute( 'CLASS' ) returned a stale value when class updates were pending, because the flush guard compared the attribute name case-sensitively. See #65372.
Attribute lookups are ASCII-case-insensitive, but the pending-class flush in get_attribute() compared the requested name case-sensitively, returning a stale value for spellings like "CLASS". See #65372.
From adversarial review: pins for class helpers over replaced source values, boolean attributes with NULL-byte names, verbatim prefix matching in get_attribute_names_with_prefix(), and HTML Processor end-tag matching across NULL and U+FFFD spellings (browser-verified: both spellings tokenize to the same name). Documents the @SInCE 7.1.0 behavior on indirectly-affected getters and the known asymmetry of set_modifiable_text(), whose value reads back normalized unlike attribute values, which round-trip verbatim. See #65372.
|
The following accounts have interacted with this PR and/or linked issues. I will continue to update these lists as activity occurs. You can also manually ask me to refresh this list by adding the Core Committers: Use this line as a base for the props when committing in SVN: To understand the WordPress project's expectations around crediting contributors, please review the Contributor Attribution page in the Core Handbook. |
sirreal
added a commit
that referenced
this pull request
Jun 11, 2026
# Conflicts: # src/wp-includes/html-api/class-wp-html-tag-processor.php
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
The Tag Processor defers the HTML spec's input-stream preprocessing (newline normalization) and the tokenizer's U+0000 NULL replacements while scanning, applying them when values are read out of the document. That deferral was implemented inconsistently:
get_modifiable_text()andget_doctype_info()apply it, but attribute values, attribute names, and tag names do not. Found via differential fuzzing against browsers; getters returned values no browser ever produces:Dom\HTMLDocument)get_attribute()value, raw\r/\r\n\nget_attribute()value, raw\x00\x00da\x00ta+da\u{FFFD}tacollapse to one attribute, first winsget_tag()with\x00DI\x00VDI\u{FFFD}VA downstream consequence: CSS attribute selectors match wrong sets, including a false positive (
[a="x\d y"]matches in WP, in no browser).Changes
get_attribute(): source values normalize\r\n/\r→\nbefore character-reference decoding (so still decodes to a preserved CR) and replace\x00→ U+FFFD after decoding. Order is load-bearing: the decoder's semicolon-less-reference follower check must see the raw NULL byte, as the tokenizer does — and no character reference can produce a NULL, so post-decode replacement is equivalent for the value's own bytes. Values enqueued viaset_attribute()remain verbatim, matching DOMsetAttribute().class(get_attribute( 'CLASS' )no longer returns a stale value).\x00→ U+FFFD at the single point of creation (parse_next_attribute()): names collapsing after replacement are duplicates (first wins; removal removes all spans), lookups by raw-NULL spelling miss, by U+FFFD spelling hit, andset_attribute()/remove_attribute()target the replaced name. Raw document spans are untouched — the lossless round-trip is preserved.get_tag()/get_token_name()/get_qualified_tag_name()return names with\x00→ U+FFFD. Internal token identification still compares raw bytes (<scr\x00ipt>is not SCRIPT in browsers either).next_tag( array( 'tag_name' => … ) )matches in the same exposed alphabet, soget_tag()round-trips into queries and the Tag Processor agrees with the HTML Processor.ctype_alnum()classifies bytes ≥ 0x80 as alphanumeric under UTF-8 locales, wrongly suppressing decodes (e.g.x&éy) and making output locale-dependent.class_list()no longer replaces NULL bytes itself: source values already arrive replaced viaget_attribute(), and API-supplied values keep them, asElement.classListdoes.Performance
get_attribute()is a hot path; replacements are guarded bystrposchecks. Benchmarks (PHP 8.4, medians of 3): scanning 100-tag documents reading 3 attributes each ×2000: trunk 667 ms → 699 ms. Reading a 10.8 KB clean value ×200k: 147 ms → 258 ms (the residual cost is unavoidable byte inspection).Testing
New suite
tests/phpunit/tests/html-api/wpHtmlTagProcessor-input-preprocessing.phpplus decoder tests; every behavioral change landed red-test-first with browser-verified expectations. Fullhtml-api+html-api-html5lib-testsgroups pass; html5lib results are byte-identical to trunk.Follow-up
The serialization PR (#42) will be rebuilt on top of this branch: with correct getters, it reduces to
serialize_decoded_text()(decoded CR → for idempotentnormalize()) and drops itsget_attribute_for_serialization()workaround.(Trac ticket 65372)